24 research outputs found

    Integrating and Ranking Uncertain Scientific Data

    Get PDF
    Mediator-based data integration systems resolve exploratory queries by joining data elements across sources. In the presence of uncertainties, such multiple expansions can quickly lead to spurious connections and incorrect results. The BioRank project investigates formalisms for modeling uncertainty during scientific data integration and for ranking uncertain query results. Our motivating application is protein function prediction. In this paper we show that: (i) explicit modeling of uncertainties as probabilities increases our ability to predict less-known or previously unknown functions (though it does not improve predicting the well-known). This suggests that probabilistic uncertainty models offer utility for scientific knowledge discovery; (ii) small perturbations in the input probabilities tend to produce only minor changes in the quality of our result rankings. This suggests that our methods are robust against slight variations in the way uncertainties are transformed into probabilities; and (iii) several techniques allow us to evaluate our probabilistic rankings efficiently. This suggests that probabilistic query evaluation is not as hard for real-world problems as theory indicates

    A Statistical Model of Protein Sequence Similarity and Function Similarity Reveals Overly-Specific Function Predictions

    Get PDF
    BACKGROUND:Predicting protein function from primary sequence is an important open problem in modern biology. Not only are there many thousands of proteins of unknown function, current approaches for predicting function must be improved upon. One problem in particular is overly-specific function predictions which we address here with a new statistical model of the relationship between protein sequence similarity and protein function similarity. METHODOLOGY:Our statistical model is based on sets of proteins with experimentally validated functions and numeric measures of function specificity and function similarity derived from the Gene Ontology. The model predicts the similarity of function between two proteins given their amino acid sequence similarity measured by statistics from the BLAST sequence alignment algorithm. A novel aspect of our model is that it predicts the degree of function similarity shared between two proteins over a continuous range of sequence similarity, facilitating prediction of function with an appropriate level of specificity. SIGNIFICANCE:Our model shows nearly exact function similarity for proteins with high sequence similarity (bit score >244.7, e-value >1e(-62), non-redundant NCBI protein database (NRDB)) and only small likelihood of specific function match for proteins with low sequence similarity (bit score <54.6, e-value <1e(-05), NRDB). For sequence similarity ranges in between our annotation model shows an increasing relationship between function similarity and sequence similarity, but with considerable variability. We applied the model to a large set of proteins of unknown function, and predicted functions for thousands of these proteins ranging from general to very specific. We also applied the model to a data set of proteins with previously assigned, specific functions that were electronically based. We show that, on average, these prior function predictions are more specific (quite possibly overly-specific) compared to predictions from our model that is based on proteins with experimentally determined function

    Ten new high-quality genome assemblies for diverse bioenergy sorghum genotypes

    Get PDF
    INTRODUCTION: Sorghum (Sorghum bicolor (L.) Moench) is an agriculturally and economically important staple crop that has immense potential as a bioenergy feedstock due to its relatively high productivity on marginal lands. To capitalize on and further improve sorghum as a potential source of sustainable biofuel, it is essential to understand the genomic mechanisms underlying complex traits related to yield, composition, and environmental adaptations. METHODS: Expanding on a recently developed mapping population, we generated de novo genome assemblies for 10 parental genotypes from this population and identified a comprehensive set of over 24 thousand large structural variants (SVs) and over 10.5 million single nucleotide polymorphisms (SNPs). RESULTS: We show that SVs and nonsynonymous SNPs are enriched in different gene categories, emphasizing the need for long read sequencing in crop species to identify novel variation. Furthermore, we highlight SVs and SNPs occurring in genes and pathways with known associations to critical bioenergy-related phenotypes and characterize the landscape of genetic differences between sweet and cellulosic genotypes. DISCUSSION: These resources can be integrated into both ongoing and future mapping and trait discovery for sorghum and its myriad uses including food, feed, bioenergy, and increasingly as a carbon dioxide removal mechanism

    Validating Annotations for Uncharacterized Proteins in Shewanella oneidensis

    No full text
    Proteins of unknown function are a barrier to our understanding of molecular biology. Assigning function to these “uncharacterized” proteins is imperative, but challenging. The usual approach is similarity searches using annotation databases, which are useful for predicting function. However, since the performance of these databases on uncharacterized proteins is basically unknown, the accuracy of their predictions is suspect, making annotation difficult. To address this challenge, we developed a benchmark annotation dataset of 30 proteins in Shewanella oneidensis. The proteins in the dataset were originally uncharacterized after the initial annotation of the S. oneidensis proteome in 2002. In the intervening 5 years, the accumulation of new experimental evidence has enabled specific functions to be predicted. We utilized this benchmark dataset to evaluate several commonly utilized annotation databases. According to our criteria, six annotation databases accurately predicted functions for at least 60% of proteins in our dataset. Two of these six even had a “conditional accuracy” of 90%. Conditional accuracy is another evaluation metric we developed which excludes results from databases where no function was predicted. Also, 27 of the 30 proteins' functions were correctly predicted by at least one database. These represent one of the first performance evaluations of annotation databases on uncharacterized proteins. Our evaluation indicates that these databases readily incorporate new information and are accurate in predicting functions for uncharacterized proteins, provided that experimental function evidence exists

    The necessity of adjusting tests of protein category enrichment in discovery proteomics

    No full text
    Motivation: Enrichment tests are used in high-throughput experimentation to measure the association between gene or protein expression and membership in groups or pathways. The Fisher's exact test is commonly used. We specifically examined the associations produced by the Fisher test between protein identification by mass spectrometry discovery proteomics, and their Gene Ontology (GO) term assignments in a large yeast dataset. We found that direct application of the Fisher test is misleading in proteomics due to the bias in mass spectrometry to preferentially identify proteins based on their biochemical properties. False inference about associations can be made if this bias is not corrected. Our method adjusts Fisher tests for these biases and produces associations more directly attributable to protein expression rather than experimental bias

    Development of the Digital Arthritis Index, a Novel Metric to Measure Disease Parameters in a Rat Model of Rheumatoid Arthritis

    No full text
    Despite a broad spectrum of anti-arthritic drugs currently on the market, there is a constant demand to develop improved therapeutic agents. Efficient compound screening and rapid evaluation of treatment efficacy in animal models of rheumatoid arthritis (RA) can accelerate the development of clinical candidates. Compound screening by evaluation of disease phenotypes in animal models facilitates preclinical research by enhancing understanding of human pathophysiology; however, there is still a continuous need to improve methods for evaluating disease. Current clinical assessment methods are challenged by the subjective nature of scoring-based methods, time-consuming longitudinal experiments, and the requirement for better functional readouts with relevance to human disease. To address these needs, we developed a low-touch, digital platform for phenotyping preclinical rodent models of disease. As a proof-of-concept, we utilized the rat collagen-induced arthritis (CIA) model of RA and developed the Digital Arthritis Index (DAI), an objective and automated behavioral metric that does not require human-animal interaction during the measurement and calculation of disease parameters. The DAI detected the development of arthritis similar to standard in vivo methods, including ankle joint measurements and arthritis scores, as well as demonstrated a positive correlation to ankle joint histopathology. The DAI also determined responses to multiple standard-of-care (SOC) treatments and nine repurposed compounds predicted by the SMarTRTM Engine to have varying degrees of impact on RA. The disease profiles generated by the DAI complemented those generated by standard methods. The DAI is a highly reproducible and automated approach that can be used in-conjunction with standard methods for detecting RA disease progression and conducting phenotypic drug screens

    Volume 185: Defense, Diplomacy and Development: Making a 3D Strategy Work in the Great Lakes Region of Africa

    No full text
    Created as part of the 2013 Jackson School for International Studies SIS 495: Task force. Adam Smith, Task Force Advisor; Jared Sarkis and Sarah Stauch, Coordinators.The eastern Democratic Republic of Congo (DRC) has been the site of cyclical violence for years and, as a result, the country has failed to become a developed member of the international community. The International Rescue Committee estimated that 5.4 million people died in the region between August 1998 and April 2007 with over 400,000 persons displaced.1 Ethnic tensions exacerbated by non-state and state actors vying for power in the absence of a strong state have fueled a series of wars and crimes against humanity. While many members of the international community have attempted to intervene in the conflict in the hopes of finding a peaceful solution, a large-scale coordinated effort has not yet been structured; the lack of synchronized support allows the conflict to continue and further harm the region
    corecore